Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add legal hold membership to device reporting #192

Merged
merged 23 commits into from
Mar 1, 2021

Conversation

maddie-vargo
Copy link
Contributor

@maddie-vargo maddie-vargo commented Dec 29, 2020

Description of Change

This PR addresses Phase I of #176 to add an --include-devices option to the legal-hold command to show devices associated with the legal hold custodians captured by the query. This option will also total storage held by these devices and print it in a table grouped by organization.

This PR requires pandas as a dependency. Python 3.5 does not support the pandas library, so this should not be merged until support ends for 3.5.

Testing Procedure

code42 legal-hold show 1234 --include-devices

PR Checklist
Did you remember to do the below?

  • Add unit tests to verify this change
  • Add an entry to CHANGELOG.md describing this change
  • Add docstrings for any new public parameters / methods / classes

@github-actions
Copy link

github-actions bot commented Dec 29, 2020

CLA Assistant Lite bot All contributors have signed the CLA ✍️

@maddie-vargo
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

CHANGELOG.md Outdated
@@ -34,6 +34,9 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta
- `search` to search for audit-logs.
- `send-to` to send audit-logs to server.

- `code42 legal-hold show` option:
Copy link
Contributor

@antazoey antazoey Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a new devices list command that is merged by not yet released.

Would it make more sense to have a command code42 devices list --matter-id <UID> instead?

Just want to make sure all the options are considered, I don't really understand the use-case.

Copy link
Contributor Author

@maddie-vargo maddie-vargo Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point. I chose to add it to legal-hold show because that command already showed details on a legal hold matter object, including users on legal hold and (optional) preservation policy details. I thought showing devices on legal hold would be a logical fit here.

From what I gather, the devices command renders mostly the Computer api resource. Unfortunately, there is no flag in the Computer api response that states whether a device is on legal hold. In order to pull legal hold devices, you first need to pull legal hold membership to get a list of users and then find the devices associated with those users. It would not be impossible to move this into the devices command, but we would still have to run some sort of legal hold py42 resource to determine the scope of devices.

Pinging @ceciliastevens for input as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a code42 legal-hold list-devices [MATTER-ID]. Trying to find a way to afford output format options (At least table and csv).

Copy link
Contributor Author

@maddie-vargo maddie-vargo Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if we can find a way to structure this to add a format option, that would be helpful to most customers.

Currently, the legal-hold show command outputs data in the following formats:

  • Matter info - table
  • Users - columns
  • (optional) Preservation policy - JSON

What do you think about removing users from the show command and creating code42 legal-hold list-users [Matter ID] with an --include-devices optiion? I think it makes sense to keep users and devices closely tied together, rather than in separate commands. From a code perspective, to get devices, you'll have to get users first, anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! @timabrmsn What are you thoughts on this? I believe you were the one who implemented the show command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it makes more sense to add legal hold status as an option to the devices list command, since that would be a useful option for people who are creating a report of all their devices to know which ones are on legal hold quickly when taking inventory of other things. It could just add a legalHoldMatterId column to the output.

And keeping users under legal-hold show still makes sense to me, since it's a high level overview of the matter, of which membership is a key detail. Otherwise the show command just becomes a single row of the list command with the option to also show preservation policy.

devices list already has --include-usernames which makes a separate call to api/Users and then merges the DataFrames on userId, so adding legal hold status would be easy enough to add using the same logic.

The storage total report is an interesting idea, but might be better suited to include in devices list report also, so it can be used more generically and not tied directly to only legal hold. I'm thinking something like --include-storage-totals on devices list that just adds a column per device for the sum of all that device's archives. Then if someone wanted to get a report broken down by matter Id, they could just filter by matter_id and sum the total in the csv report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, devices it is. The device pull actually includes userUid without the username option, so it's even easier to merge in the legal hold data.

There are two main things that we want this to accomplish:

  • Limit the device list to devices on a specific legal hold matter
    devices --matter-uid - option to limit device search base on a specific legal hold matter, adds in columns such as legalHoldStatus, legalHoldMatterId, legalHoldMatterName

  • Add in legal hold membership columns for whatever device query you're running.
    devices -include-legal-hold-membership - option to option to include legal hold membership on to the devices table, adds same columns as above but doesn't limit the population

This feels a little redundant, so any feedback is appreciated.


As for storage-totals: is a sum of all device's archives appropriate? (this is how I wrote my first draft). However, the current legal hold script does some logic to determine which destination is the most recent and complete, and only uses that archive size as the total storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since filtering by matter-id outside of the CLI is simple enough (i.e. grep/Select-String or directly in excel), can we leave that to the end-user? I'd like to not have our option list expand too much. And I have some ideas for more generic column/row filtering logic that could work on any cmd that outputs a dataframe. Something like `--filter 'matter-id == 42' that would convert to a pandas expression under the hood. Although I haven't started trying to implement that so I'm not sure how feasible it is...

Also, because a device can be part of multiple legal holds, we need to think of how to display that. Can we just have a single column that has a list of matter-uids the device is part of, maybe separated by semi-colons or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for storage totals I think adding a usedStorage column that sums the total combined archive size for a device would work, maybe we could also add an archiveCount column that gets added with that option?

Trying to keep what features we add as generic and multi-use-case friendly as possible.

)
if len(devices_dataframe.index) > 0:
echo("\nMatter Members and Devices:\n")
click.echo(devices_dataframe.to_csv())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use table format for show commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List commands, however, are known to support various output formats, inc. CSV.... My first comment on this PR suggests that this would make sense as a code42 devices list --matter-id <UID> command, which then you can tack on -f CSV or -f JSON etc.... Would that be better?

@@ -191,6 +459,11 @@ def preservation_policy_response(mocker):
return _create_py42_response(mocker, POLICY_RESPONSE)


@pytest.fixture
def devices_list_generator(mocker):
return [TEST_DEVICE_PAGE]
Copy link
Contributor

@antazoey antazoey Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it is better to actually use generators for mocks when they are supposed to be rather than lists.. It saves future headaches, in my experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to use side_effect instead of return_value in that case, not sure

@maddie-vargo
Copy link
Contributor Author

I have committed new changes that remove the edits to legal-hold and move it all to the devices command.

  • If a device is on multiple legal holds, I concatenate the IDs, names, and status into a comma-separated list.
  • Since legal hold membership is determined at the user level, deactivated devices can print legal hold membership. It did not seem meaningful to present that for deactivated devices, so I cleared those cells.

As for the used-storage piece - this data is present if the user includes the --include-backup-usage option. In efforts to keep this as generalized as possible, is it too much to add in a --include-total-storage option that would bring in a column that totals device storage across destination? Re: Tim's comment above, this calculation could be left to the end user.

@kiran-chaudhary
Copy link
Contributor

Can we update the description please based on latest changes. I thought to try it out reading the description and then looking at the code it confused me more.

if df.empty:
click.echo("No results found.")
else:
formatter = DataFrameOutputFormatter(format)
formatter.echo_formatted_dataframe(df)


def _add_legal_hold_membership_to_device_dataframe(sdk, df):
matters = _get_all_active_matters(sdk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kiran above that we probably shouldn't use the legal-hold module's internal functions, just for the sake of if someone later needs to change them it could break stuff here.

Since we only need to iterate through all membership dicts, we can simplify getting them all with a generator function instead of building a new list and extending it:

def _get_all_active_hold_memberships(sdk):
    for page in sdk.legalhold.get_all_matters(active=True):
        for matter in page["legalHolds"]:
            for _page in sdk.legalhold.get_all_matter_custodians(
                legal_hold_uid=matter["legalHoldUid"], active=True
            ):
                for membership in _page["legalHoldMemberships"]:
                    yield membership

And pandas has a great function that can 'flatten' nested json into a dataframe for you, which can simplify a lot of this logic:

>>> df = pandas.json_normalize(_get_all_active_hold_memberships(sdk))
>>> df.head()
  legalHoldMembershipUid  active                   creationDate legalHold.legalHoldUid legalHold.name        user.userUid                   user.username                      user.email user.userExtRef
0     955061003769628092    True  2020-05-20T15:58:38.823-05:00     955060991270602172          Test2  945056771151950748                test@example.com                test@example.com            None
1     988814086101244865    True  2021-01-08T11:25:23.645-06:00     946193444244471969           Test  973619347957360307                legalhold@code42.com            legalhold@code42.com        None
2     955320624671252917    True  2020-05-22T10:57:44.903-05:00     946193444244471969           Test  955163198918558214                de60@example.com                de60@example.com            None
3     955320624637698485    True  2020-05-22T10:57:44.892-05:00     946193444244471969           Test  955163197341499910                de59@example.com                de59@example.com            None
4     955320624604144053    True  2020-05-22T10:57:44.885-05:00     946193444244471969           Test  955163195865104902                de58@example.com                de58@example.com            None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_normalize is great for unpacking the nested dictionary. However, I seem to be losing records (1 per matter). If I run the generator response table through pandas.DataFrame, I get a table with more rows than I do running the same generator response through json_normalize. I haven't quite worked this out yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'll see if I get the same behavior and try to figure out what's happening too.

Copy link
Contributor Author

@maddie-vargo maddie-vargo Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to serialize the generator into a list in order for all records to be included, which may defeat the purpose of the generator. I pushed those changes moments ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there was a recent pandas bug about the generator thing (they weren't expecting generators to be passed): pandas-dev/pandas#35923

It's fixed in 1.2.2, but I think we can probably just call list() on the result for now instead of force updating to the latest pandas.

df.loc[
df["status"] == "Deactivated",
["legalHoldMemberActive", "legalHoldUid", "legalHoldName"],
] = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using np.nan for null values here, let's use an empty string, I think NaN would confuse people in the output and think maybe there was a bug. To replace any np.nan values resulting from the .groupby(), you can call .fillna(value="") on the df to replace those as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose NaN to match the value assumed by pandas of an empty value. NaN's are generated when the legal hold table is merged with the devices table. If I use the above solution prior to merging with the devices table, the legalHoldUid and legalHold.name columns will contain a mix of None's, actual values, and NaNs.

To confirm: which of the following would you prefer?

  • NaN's in table view
  • None's in table view

(Either will produce null values in the other output formats)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I meant to say "resulting from the.merge()" instead of .groupby(). Yeah, the .fillna(value="") needs to be called after the merge.

I think we should convert NaN/None values to an empty string for table/CSV outputs, but we'd probably want to keep them as NaN/None when we output to json so it translates to null. But we should push that logic to the DataFrameOutputFormatter so we only need to handle it in one place for any commands that use DataFrames.

I'll start a new PR to rework the formatter to handle that, since I'd like to refactor it a bit for clarity and add some tests around handling dataframe nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've opened #245 to add the null handling in the DataFrameOutputFormatter, so we don't need to be concerned with converting them here, since we want the conversion to be conditional on the output format chosen by the end-user.

Sorry to have you flip/flop this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved, and awaiting #245

@maddie-vargo maddie-vargo changed the title Add legal-hold option to show devices and storage Add legal hold membership to device reporting Feb 5, 2021
@maddie-vargo
Copy link
Contributor Author

maddie-vargo commented Feb 16, 2021

All PR comments should be resolved. (Waiting on NaN resolution in #245)

The PR does not include a per-org storage calculation. The --include-backup-usage options reports archive size per archive. Is it preferred to create another option for --include-storage-totals, or rely on the user to calculate these totals outside of the CLI?

@timabrmsn
Copy link
Contributor

Hmm, it seems that the --include-backup-usage logic was broken somewhere, since that option now doesn't actually add anything to the output..

@maddie-vargo
Copy link
Contributor Author

maddie-vargo commented Feb 17, 2021

I'm not able to recreate that behavior - the option spits out backup usage data on my branch. Diffing with the upstream code42cli master doesn't show any changes to backup usage logic.

Maybe something changed in py42? I'm using py42 v. 1.11.0 on my end

@timabrmsn
Copy link
Contributor

Nope, it was just a device with a really large backup usage result towards the bottom of the list that was pushing that column way to the right.

I think a separate option + column for total storage a device is using would be really helpful. I also am thinking --include-backup-usage would probably be better turned into a separate command like list-archives, where it's one row per archive like our list-backup-sets. Right now parsing any specific data out of the backupUsage column is difficult for our end-users.

But that can be a separate PR...

I know a big part of the goal of this PR is to enable legal hold storage reporting, right? If we add an --include-storage-totals option that ensures include_backup_usage is True on the api call (but not necessarily on the table output), and then extracts and sums the total archiveBytes for each device and sticks it in a totalUsedStorage column, is that close enough for your customers who want these reports?

@timabrmsn
Copy link
Contributor

@maddie-vargo PR #245 just merged

@maddie-vargo
Copy link
Contributor Author

--include-total-storage option now pushed up

@timabrmsn
Copy link
Contributor

I think this is good, just the small thing of filtering out V2 archives from the storage totals. I'll write up a ticket for testing with all the acceptance criteria and make sure I haven't missed anything else as I go through documenting each change.

@timabrmsn
Copy link
Contributor

json_normalize names the unpacked nested columns with dotted notation legalHold.legalHoldUid, legalHold.name can we rename this to just legalHoldUid and legalHoldName?

@@ -466,8 +471,9 @@ def _break_backup_usage_into_total_storage(backup_usage):
total_storage = 0
archive_count = 0
for archive in backup_usage:
archive_count += 1
total_storage += archive["archiveBytes"]
if archive["archiveFormat"] == "ARCHIVE_V1":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if archive["archiveFormat"] != "ARCHIVE_V2":, since while V3 archives are not prevalent, they still should show up in the results.

@timabrmsn timabrmsn marked this pull request as ready for review February 24, 2021 19:44
Copy link
Contributor

@kiran-chaudhary kiran-chaudhary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from change-log comment LGTM.

CHANGELOG.md Outdated
### Added

- `code42 devices list` option:
- `--include-legal-hold-membership` to add legal hold columns to the result output, printing the legal hold matter name and ID for any active device actively on legal hold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rephrase printing .... to something like prints the legalhold matter name and ID for any active device on legal hold!?

atleast active device actively seems off while reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is terrible phrasing. Resolved with latest commit

CHANGELOG.md Outdated
@@ -58,6 +58,7 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta
- `search` to search for audit-logs.
- `send-to` to send audit-logs to server.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like our changelog has gotten messed up? The #Unreleased tag should not be there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unparalleled-js I re-pulled from master. The #Unreleased tag should be at the top, right? Let me know if I should fix it with this PR

@timabrmsn timabrmsn merged commit 65c5426 into code42:master Mar 1, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants